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1  Introduction 


III  many  natural  Ipaniing  iirohlems.  tlie  learnor  lias  tlio 
aliilify  to  ar(  on  its  environniont  and  gather  data  tliat 
will  resolve  its  iincertaint  ies.  Most  machine  learning  re¬ 
search,  howi'ver,  treats  tlie  learner  as  a  passive  recipient 
of  dat  a  and  ignores  the  role  of  tliis  "active”  component 
of  learning.  In  this  pajier  we  employ  techniques  from 
the  field  of  Ojitimal  Ex]ieriment  Design  (OED)  to  guide 
the  actions  of  a  learner,  selecting  actions/queries  that 
are  statistically  expected  to  minimize  its  uncertainty  and 
error. 

1.1  Active  learning 

Exploiting  the  active  component  of  learning  typically 
leads  to  im]iroved  generalization,  usually  at  the  cost  of 
additional  com]iut.ation  (see  Figure  1)  [Angluin,  1982; 
Clolin  et,  ah,  1990;  Hwang  et  ah.  1991].'  There  are  two 
common  situations  where  this  tradeoff  is  desirable:  In 
many  sit.uat.ions  the  cost  of  taking  an  action  outweighs 
the  cost  of  the  computation  required  to  incorporate  new 
information  into  the  model.  In  these  cases  we  wish  to 
select  queries  judiciously,  so  that  we  can  build  a  good 
model  with  the  fewest  data.  This  is  the  case  if,  for  exam¬ 
ple,  we  are  drilling  oil  wells  or  taking  seismic  measure¬ 
ments  to  locate  buried  waste.  In  other  situations  the 
data,  although  cheap,  must  be  chosen  carefully  to  en¬ 
sure  thorougli  exploration.  Large  amounts  of  data  may 
be  usele.ss  if  they  all  come  from  an  uninteresting  part 
of  the  domain.  This  is  the  case  with  learning  control 
of  a  robot  arm:  ex]iloring  by  generating  random  motor 
torques  can  not  be  exiiected  to  give  good  coverage  of  the 
domain. 

As  computation  becomes  cheaper  and  faster,  more 
problems  fall  within  the  realm  where  it  is  both  desirable 
and  practical  to  pursue  active  learning,  expending  more 
computation  to  ensure  that  one’s  exploration  provides 
good  data.  The  field  of  Optimal  Experiment  Design, 
which  is  concerned  with  the  statistics  of  gathering  new 
data,  provides  a  principled  way  to  guide  this  exploration. 
This  paper  builds  on  the  theoretical  results  of  Fedorov 
[1972]  and  MacKay  [1992]  to  empirically  demonstrate 
how  OED  may  be  applied  to  neural  network  learning, 
and  to  determine  under  what  circumstances  it  is  an  ef¬ 
fective  approach. 

The  remainder  of  this  section  provides  a  formal  prob¬ 
lem  definition,  followed  by  a  brief  review  of  related  work 
using  optimal  experiment  design.  Section  2  differentiates 
several  classes  of  active  learning  problems  for  which  OED 
is  ap]U‘opriat.e.  Section  3  describes  the  theory  behind 
optimal  experiment  design,  and  Section  4  demonstrates 
its  application  to  the  problems  described  in  Section  2. 
Section  5  considers  the  computational  costs  of  these  ex¬ 
periments,  and  Section  6  concludes  with  a  discussion  of 
the  result, s  and  im]>lications  for  future  work. 


'In  some  cases  active  selection  of  training  data  can 
sharply  reduce  worst  case  computational  complexity  from 
NP-complet(‘  to  polynomial  time  [Banm  and  Lang.  1991],  and 
in  sjn'cial  cases  to  linear  time. 


Passive  learning 


Active  learning 


Figure  1:  An  active  system  will  typically  evaluate/traiu 
on  its  data  iteratively,  determining  its  next  input  based 
on  the  previous  training  examples.  This  iterative  train¬ 
ing  may  be  computationally  expensive,  especially  for 
learning  systems  like  neural  networks  where  good  incre¬ 
mental  algorithms  are  not  available. 

1.2  Problem  definition 

We  consider  the  problem  of  learning  an  input-output 
mapping  A’  —  V  from  a  set  of  m  training  examples 
{(•'•i-.'/i  )}"=!:  "here  x,  G  A,  jy,  G  Y. 

We  denote  the  parameterized  learner  /„  (),  where 
y  =  fui-r)  is  the  learner’s  output  given  input  x  and 
parameter  vector  w.  The  learner  is  trained  by  adjust¬ 
ing  ir  to  minimize  the  residual  S~  =  “ 

yi)^{fu{xi)  —  yi)  on  the  training  set.  Let  w  be  the 
weight  vector  that  minimizes  .S'-.  Then  y  =  ft;{x)  is 
the  learner's  “best  guess”  of  the  mapping  A'  —  Y:  given 
y  is  an  estimate  of  the  corresponding  y. 

At  each  time  step,  the  learner  is  allowed  to  select  a 
new  training  input  a-  from  a  set  of  candidate  inputs  A'. 
Tl  le  selection  of  i-  may  be  viewed  as  a  “query”  (as  to  an 
oracle),  as  an  “experiment,”  or  simply  as  an  “action.” 
Having  selected  a-,  the  learner  is  given  the  correspond¬ 
ing  y,  and  the  resulting  new  example  (x,y)  is  achh'd  to 
the  training  set.  The  learner  incori)orates  the  new  data, 
selects  another  new  x  and  the  process  is  repeated. 

The  goal  is  to  choose  examples  that  minimize  the  ex¬ 
pectation  of  the  learner's  mean  scpiared  error  Emsp:  = 

where  {■) ^-  represents  the 
expected  value  over  A'.  In  contrast  to  some  other  learn- 


1 


ing  paradigms  [Valiant,  1984;  Blumer  et  al.,  1986],  we 
will  assume  that  the  input  distribution  Vx  is  known." 
Below,  we  present  several  example  problems: 

Example  1:  mapping  buried  waste.  Consider  a 
mobile  sensor  array  traversing  a  landscape  to  map  out 
subsurface  electromagnetic  anomalies.  Its  location  at 
time  t  serves  as  input  Xt,  and  the  instrument  reading 
at  that  location  is  output  yt.  At  the  next  time  step,  it 
can  choose  its  new  input  x  from  any  location  contiguous 
to  its  present  position. 

Example  2:  robot  arm  dynamics.  Consider  learn¬ 
ing  the  dynamics  of  a  robot  arm.  The  input  is  the 
state-action  triplet  Xt  =  where  0t  and  0t 

are  the  arm’s  joint  angles  and  velocities,  respectively, 
and  Tt  is  the  torque  applied  at  time  t.  The  output 

=  {0<4.i,  0j+i}  is  the  resulting  state.  Note  that  here, 
although  we  may  specify  an  arbitrary  torque  Tt,  the  rest 
of  the  input,  {0t,0i}  is  determined  by  yt-i- 

We  emphasize  that  while  the  above  problem  defini¬ 
tion  has  wide-ranging  application,  it  is  by  no  means  all- 
encompassing.  For  some  learning  problems,  we  are  not 
interested  in  the  entire  mapping  X  —*Y,  but  in  finding 
the  X  that  maximizes  y.  In  this  case,  we  may  rely  on 
the  broad  literature  of  optimization  and  response  sur¬ 
face  techniques  [Box  and  Draper,  1969].  In  other  learn¬ 
ing  problems  there  may  be  additional  constraints  that 
must  be  considered,  such  as  the  need  to  avoid  “failure” 
states.  If  the  learner  is  required  to  perform  as  it  learns 
(e.g.  in  a  control  task),  we  may  also  need  to  balance  ex¬ 
ploration  and  exploitation.  Such  constraints  and  costs 
may  be  incorporated  into  the  data  selection  criterion  as 
additional  costs,  but  these  issues  are  beyond  the  scope 
of  this  paper.  In  this  paper  we  assume  that  the  cost  of 
the  query  x  is  independent  of  x,  and  that  the  sole  aim 
of  active  learning  is  to  minimize  EmsE' 

1.3  Related  work  with  optimal  experiment 
design 

The  literature  on  optimal  experiment  design  is  immense 
and  dates  back  at  least  50  years.  We  will  just  mention 
here  a  few  closely  related  theoretical  results  and  empiri¬ 
cal  studies;  the  interested  reader  should  consult  Atkinson 
and  Donev  [1992]  for  a  survey  of  results  and  applications 
using  optimal  experiment  design. 

A  canonical  description  of  the  theory  of  OED  is  given 
in  Fedorov  [1972].  MacKay  [1992]  showed  that  OED 
could  be  incorporated  into  a  Bayesian  framework  for 
neural  network  data  selection  and  described  several  in¬ 
teresting  optimization  criteria.  Sollich  [1994]  considers 
the  theoretical  generalization  performance  of  linear  net¬ 
works  given  greedy  vs.  globally  optimal  queries  and  vary¬ 
ing  assumptions  on  teacher  distributions. 

Empirically,  optimal  experiment  design  techniques 
have  been  successful  when  used  for  system  identifica¬ 
tion  tasks.  In  these  cases  a  good  parameterized  model 
of  the  system  is  available,  and  learning  involves  finding 

^Both  assumptions  are  reasonable  in  different  situations; 
if  we  are  attempting  to  learn  to  control  a  robot  arm,  for 
example,  it  is  appropriate  to  assume  that  we  know  over  what 
range  we  wish  to  control  it. 


the  proper  parameters.  Armstrong  [1989]  used  a  form 
of  OED  to  identify  link  masses  and  inertial  moments 
of  a  robot  arm,  and  found  that  automatically  gener¬ 
ated  training  trajectories  provided  a  significant  improve¬ 
ment  over  human-designed  trajectories.  Subrahmonia  et 
al.  [1992]  successfully  used  experiment  design  to  guide 
exploration  of  a  sensor  moving  along  the  surface  of  an 
object  parameterized  as  an  unknown  quadric. 

Empirical  work  on  using  OED  with  neural  networks  is 
sparse.  Plutowski  and  White  [1993]  successfully  used  it 
to  filter  an  already-labeled  data  set  for  maximally  infor¬ 
mative  points.  Choueiki  [1994]  has  successfully  trained 
neural  networks  on  quadratic  surfaces  with  data  drawn 
according  to  the  D-optimality  criterion,  which  is  dis¬ 
cussed  in  the  appendix. 

2  Learning  with  static  and  dynamic 
constraints 

As  seen  in  Section  1.2,  different  problems  impose  dif¬ 
ferent  constraints  on  the  specification  of  x.  These  con¬ 
straints  may  be  classified  as  being  either  static  or  dy¬ 
namic,  and  problems  with  dynamic  constraints  may  be 
further  divided  according  to  whether  or  not  the  dynam¬ 
ics  of  the  constraints  are  known  a  priori.  The  remainder 
of  this  section  elaborates  on  these  distinctions.  Exam¬ 
ples,  with  experimental  results  for  each  category,  will  be 
given  in  Section  4. 

2.1  Active  learning  with  static  constraints 

When  a  learner  has  static  input  constraints,  its  range 
of  choices  for  x  is  fixed,  regardless  of  previous  actions. 
Examples  of  problems  with  static  constraints  include  set¬ 
ting  mixtures  of  ingredients  for  an  industrial  process  or 
selecting  places  to  take  seismic  or  electromagnetic  mea¬ 
surements  to  locate  buried  waste. 

The  bulk  of  research  on  active  learning  per  se  has 
concentrated  on  learning  with  static  constraints.  In 
this  setting,  active  learning  algorithms  are  compared 
against  algorithms  learning  from  randomly  chosen  ex¬ 
amples.  In  general,  the  number  of  randomly  chosen  ex¬ 
amples  needed  to  achieve  an  expected  error  of  no  more 
than  e  scales  as  0(Ylog7)  [Blumer  et  ah,  1989;  Baum 
and  Haussler  1989;  Cohn  and  Tesauro,  1992;  Haussler, 

1992] .  In  some  situations,  active  selection  of  training  ex¬ 
amples  can  reduce  the  sample  complexity  to  0(log|),® 
although  worst  case  bounds  for  unconstrained  querying 
are  no  better  than  those  for  choosing  at  random  [Eisen- 
burg  and  Rivest,  1990].  Average  case  analysis  indicates 
that  on  many  domains  the  expected  performance  of  ac¬ 
tive  selection  of  training  examples  is  significantly  bet¬ 
ter  than  that  of  random  sampling  [Freund  and  Seung, 

1993] ;  these  results  have  also  been  supported  by  empiri¬ 
cal  studies  [Cohn  et  ah,  1990;  Hwang  et  ah,  1991;  Baum 
and  Lang,  1991]. 

A  limitation  of  the  active  learning  algorithms  men¬ 
tioned  above  is  that  they  are  only  applicable  to  specific 
active  learning  problems:  the  algorithms  of  Cohn  et  ah, 
and  Hwang  et  ah  are  limited  to  classification  problems, 
and  Baum  and  Lang’s  algorithm  is  further  limited  to  a 
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® Consider  cases  where  binary  search  is  applicable. 
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spocific  iiol.work  arrliitoct uro  (siiigk'  hidden  layer  with 
sigmoidal  units).  The  OED-hasod  approach  discussed 
in  tliis  paper  is  ap]dicahle  to  any  network  architecture 
whos('  output,  is  dilT('rentiahle  with  resj^ect  to  its  param¬ 
eters,  and  may  h('  used  on  both  regression  and  cla.ssifi- 
cat.ion  jiroldems. 
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Figure  2:  In  proldcms  witli  dynamic  constraints,  the  set 
of  candidate  r  can  change  after  each  ciucry.  The 
accessilrle  to  tlie  learner  on  the  bottom  depends  on  the 
choice  made  for  ap. 

2.2  Active  learning  with  dynamic  constraints 

In  many  learners,  the  constraints  on  x  are  dynamic,  and 
change  over  time.  Inputs  that  are  available  to  us  on 
one  time  step  may  no  longer  be  available  on  the  next. 
Typically,  these  constraints  represent  some  state  of  the 
system  that  is  altered  by  the  learner’s  actions.  Training 
examples  then  describe  a  trajectory  through  state  space. 

In  some  such  problems  the  dynamics  of  the  constraints 
are  known,  and  we  may  predict  a  priori  what  constraints 
we  will  face  at.  time  /,  given  an  initial  state  and  actions 
X] ,  X'j,  .  .  . ,  ap.  Clonsider  Example  1,  using  a  mobile  .sen¬ 
sor  array  to  locale  buried  waste.  Wc  can  pre-plan  the 
course  the  vehicle  will  take,  but  its  successive  measure¬ 
ments  are  const  rained  to  lie  in  the  neighborhood  of  pre¬ 
vious  ones.  Alternatively,  consider  learning  the  forward 
kinematics  of  an  arm:  we  specify  joint  angles  0  in  an 


attempt  to  predict  the  arm’s  tij)  coordinates  f.  Barring  V 

any  unknown  obstacles,  we  can  move  from  our  current 
position  0(  to  a  neighboring  0(+i,  but  can  not  S('lect  an 
arbitrary  0(+i  for  the  next  time  steji. 

A  more  common,  and  more  difficult  probh'in  is  h'aru- 
ing  when  the  dynamics  of  the  constraints  are  not  known, 
and  must  be  accommodated  ouliiu'.  Learning  the  dy¬ 
namics  of  a  robot  arm  {0(,0(,r(}  {0(+i,0(  +  i}  is 

an  example  of  this  type  of  pro})lem.  At  each  time  step 
t  ,  the  model  input  j-  is  a  state-action  pair  {0(,0/,T(}, 
where  0,  and  0(  are  constrained  to  be  the  learner’s  cur¬ 
rent  state.  Until  the  action  is  selectt'd  and  taken,  the 
learner  does  not  know  what  its  new  state,  and  thus  its 
new  constraints  will  be  (this  is  in  fact  exactly  what  it  is 
attempting  to  learn). 

In  most  forms  of  constrained  learning  problems,  ran¬ 
dom  ex{>loration  is  a  poor  strategy.  Taking  random  ac¬ 
tions  leads  to  a  form  of  ‘■drunkard's  walk"  over  A',  which 
can  recpiire  an  unacceptably  large  number  of  examples 
to  give  good  coverage  [Whitehead,  1991]. 

In  cases  where  the  dynamics  of  the  constraints  are 
known  a  priori,  we  can  plan  a  trajectory  that  will  uni¬ 
formly  cover  A'  in  some  prespecifif'd  number  of  steps.  In 
general,  though,  we  will  have  to  resort  to  some  online 
process  to  decide  ‘’what  to  try  next."  Some  success¬ 
ful  heuristic  exploration  strategies  include  trying  to  visit 
unvisited  states  [Schaal  and  Atkeson,  1994],  trying  to 
visit  places  where  we  perform  poorly  [Linden  and  We¬ 
ber,  1993],  taking  actions  that  improved  our  performance 
in  similar  situations  [Schmidhuber  and  Storck,  1993],  or 
maintaining  a  heuristic  "confidence  map”  [Thrun  and 
Moller,  1992],  Some  researchers,  in  cases  wlu're  the  ex¬ 
ploration  is  considered  a  secondary  problem,  provich'  the 
learner  with  a  uniformly  distributed  training  set,  in  ef¬ 
fect  assuming  the  problem  allows  unconstrained  cpiery- 
ing  (e.g.  Mel  [1992]). 

An  important  limitation  of  the  above  work  with  dy¬ 
namic  constraints  is  that,  for  the  most  part,  the  methods 
are  restricted  to  discrete  state  spaces.  Clontinuoiis  state 
and  action  spaces  must  be  accommodated  either  through 
arbitrary  discretization  or  through  some  form  of  on-line 
partitioning  strategy,  such  as  Moore’s  Parti-Game  al¬ 
gorithm  [Moore,  1991],  The  OED-based  approach  dis¬ 
cussed  in  this  paper  is,  by  nature,  applicable  to  domains 
with  both  continuous  state  and  action  spaces. 

3  Data  selection  according  to  OED 

In  this  section,  we  review  the  theory  of  optimal  ex|')er- 
iment  design  applied  to  neural  network  learning.  As 
stated  in  the  introduction,  our  primary  goal  is  to  mini¬ 
mize  EnrsE-  An  alternative  goal  of  system  identification 
is  discussed  briefly  in  the  appendix,  and  other  interesting 
goals,  such  as  eigenvalue  maximization  and  entropy  min¬ 
imization,  mav  be  found  in  Fedorov  [1972]  and  MacKay 
[1992]. 

Error  minimization  is  pursued  in  the  OED  framework 
by  selecting  data  to  minimize  model  uncertainty.  Uncer¬ 
tainty  in  this  ca.se  is  manifested  as  the  learner’s  est  imated 
output  variance  crj.  The  justification  for  selecting  data 
to  minimize  variance  comes  from  the  naturi'  of  MSE. 
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Defining  y\x  =  {y\x)Y,  mean  squared  error  may  be  de¬ 
coupled  into  variance  and  bias  terms. 

Emse  =  {{y\x-y\xf)x 

=  {{y\x  -  y\xf)x  +  {{y\^  -  y\^f)x 
=  +  {{y\x  -  y\xf)x  ■ 

Given  an  unbiased  estimator,  or  an  estimator  for  which 
either  the  bias  is  small  or  independent  of  the  training 
set,  error  minimization  amounts  to  minimizing  the  vari¬ 
ance  of  the  estimator.  For  the  rest  of  our  computations 
we  will  neglect  the  bias  term,  and  select  data  solely  to 
minimize  the  estimated  variance  of  our  learner."*  An  in- 
depth  discussion  of  bias/ variance  tradeoff  may  be  found 
in  Geman  et  al.  [1992]. 


3.1  Estimating  variance 

Estimates  for  u?  may  be  obtained  by  adopting  tech¬ 
niques  derived  for  linear  systems.  We  write  the  network’s 
output  sensitivity  as  g{x)  =  dy\x/dw  =  dfwix)/dw,  and 
define  the  Fisher  Information  Matrix  to  be 


A  = 


1 

52  dw^ 


52 


E 
2  =  1 


dy\xi  dy\xi 
dw  dw 


+  {y\xi  -  yi) 


^  m 


2  =  1 


d^y\xi 

dw^ 


(1) 


The  approximation  in  Equation  1  holds  when  the  net¬ 
work  fits  the  data  well  or  the  error  surface  has  relatively 
constant  curvature  in  the  vicinity  of  w.  We  may  then 
write  the  parameter  covariance  matrix  as  <r^  =  and 
the  output  variance  at  reference  input  Xr  as 

(2) 

subject  to  the  same  approximations  (see  Thisted  [1988] 
for  derivations).®  Note  that  the  estimate  applies 

only  to  the  variance  at  a  particular  reference  point.  Our 
interest  is  in  estimating  u? ,  the  average  variance  over  all 
of  X.  We  do  not  have  a  method  for  directly  integrating 
over  X,  and  instead  opt  for  a  stochastic  estimate  based 
on  an  average  of  cr?|^  ,  with  Xr  drawn  according  to  Vx- 
Writing  the  first  and  second  moments  of  g'  as  g  =  {g{x))^ 
and  gg'^  =  (g{x)g{x)^') ^ ,  this  estimate  can  be  computed 
efficiently  as 

(^l)x  =  ^A-^g  +  triA-^W),  (3) 


where  tr{)  is  the  matrix  trace.  Instead  of  recomput¬ 
ing  Equation  2  for  each  reference  point,  'g  and  gg'^  may 
be  computed  over  the  reference  points,  and  Equation  3 
evaluated  once. 

*The  bias  term  will  in  fact  reappear  as  a  limiting  factor 
in  the  experimental  results  described  in  Section  4.2. 

^If  the  inverse  does  not  exist,  then  the  parameter 

covariances  are  not  well-defined.  In  practice,  one  could  use 
the  pseudo-inverse,  but  the  need  to  this  arose  very  rarely  in 
our  experiments,  even  with  small  training  sets. 


3.2  Quantifying  change  in  variance 

When  an  input  x  is  queried,  we  obtain  the  resulting  out¬ 
put  y.  When  the  new  example  {x,y)  is  added  to  the 
training  set,  the  variance  of  the  model  will  change.  We 
wish  to  select  x  optimally,  such  that  the  resulting  vari¬ 
ance,  denoted  d|,  is  minimized. 

The  network  provides  a  (possibly  inaccurate)  estimate 
of  the  distribution  F’(y|i),  embodied  in  an  estimate  of 
the  mean  (j/ji;)  and  variance  Given  infinite  com¬ 

putational  power  then,  we  could  use  these  estimates  to 
stochastically  approximate  d||„,  by  drawing  examples  ac¬ 
cording  to  our  estimate  of  ^(^[5;),  training  on  them,  and 
averaging  the  new  variances.  In  practice  though,  we 
must  settle  for  a  coarser  approximation.  Note  that  the 
approximation  in  Equation  1  is  independent  of  the  actual 
yi  values  of  the  training  set;  the  dependence  is  implicit  in 
the  choice  of  w  that  minimizes  52.  If  V(y\x)  conforms 
to  our  expectations,  w  and  </()  will  remain  essentially 
unchanged,  allowing  us  to  compute  the  new  information 
matrix  A  as 

A^  AA  ^g{x)g{xY .  (4) 

From  the  new  information  matrix,  we  may  compute  the 
new  parameter  variances,  and  from  there,  the  new  out¬ 
put  variances.  By  the  matrix  inversion  lemma 

-1 
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=  \AA^9[x)g{x) 

A~^g{x)g{x)'^  A' 


_  ._i  _  --  /r\ 

-  ^  S^+g{x)TA-^g{x)- 

The  utility  of  querying  at  x  may  be  expressed  in  terms 
of  the  expected  change  in  the  estimated  output  variance 
<T?.  The  expected  new  output  variance  at  reference  point 
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gixrY  A  ^gixr) 
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52  -b  g(x)^A-^g(x) 

[g(xrfA-^g(x)]^ 
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52  -f  g(x)'^A-^g(x) 


y\Xr 


where  o-y\x^x  is  defined  as  g{xr)'^ A~^ g{x).  Thus,  when 
X  is  queried,  the  expected  change  in  output  variance  at 
Xr  is 

.^2 


y\xri 


(6) 


Ve  compute  (A(7|\  |£  as  a  stochastic  approximation 


the 


rom  ^A(T?|j,  ^  \x  for  Xr  drawn  from  Vx-  Reusing 

;stimate  gg'^  from  the  previous  section,  we  can  write 
,he  expectation  of  Equation  6  over  X  as 


(Afr?|£; 


X,Y 


_  g{xY A  ^gg'^A  *g(i) 
+  g{x)'^A-^g{x) 


(7) 
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3.3  Selecting  an  optimal  x 

fiivoii  Equat  ion  7,  tlio  jM'ohlom  remains  of  liow  to  select 
an  x  iliat  maximizes  it.  One  ap]n'oacli  to  selecting  a  next 
iiqnit  is  to  use  selective  sampling:  evaluate  a  number  of 
possible'  random  .r,  then  choose  the  one  with  the  highest 
ex]iect.e(l  gain.  This  is  efheient  so  long  as  the  dimension 
of  the  action  space  is  small.  For  high-dimensional  prob¬ 
lems,  W('  may  use  gradient  ascent  to  efficiently  find  good 
X.  Differentiating  Equation  7  with  respect  to  .r  gives  a 
gradient. 


'2(i(x)^  A  ^  (j(i‘  A  ' 
(,S'-'-pr/(,7)7'.4-if/(.7))-'  dx 


(8) 


W('  can  “hillclimb"  on  this  gradient  to  find  a  a-  with  a 
locally  optimal  expected  change  in  average  output  vari¬ 
ance. 

It  is  worth  noting  that  both  of  these  approaches  are 
applicable  in  continuous  domains,  and  therefore  well- 
suited  to  problems  with  continuous  action  spaces.  Fur¬ 
thermore,  the  gradient  ap]n'oach  is  effectively  immune 
to  the  overabundance  of  candidate  actions  in  high- 
dimensional  action  spaces. 


3.4  A  caveat:  greedy  optimality 

We  have  described  a  criterion  for  one-step,  or  greedy  op¬ 
timization.  That  is,  each  action/query  is  chosen  to  max¬ 
imize  the  change  in  variance  on  the  next  step,  without 
regard  to  how  future  queries  will  be  chosen.  The  glob¬ 
ally  optimal,  but  computationally  expensive  approach 
would  involve  oirtimizing  over  an  entire  trajectory  of  w 
actions/queries.  Trajectory  optimization  entails  starting 
with  an  initial  trajectory,  computing  the  expected  gain 
over  it.,  and  iteratively  relaxing  points  on  the  trajectory 
towards  optimal  expected  gains  (subject  to  other  points 
along  the  trajectory  being  explored).  After  the  iteration 
has  settled,  the  first  point  in  the  trajectory  is  queried, 
and  the  relaxation  is  repeated  on  the  remaining  part  of 
the  trajectory.  Eximriments  using  this  form  of  optimiza¬ 
tion  did  not  demonstrate  measurable  improvement,  in 
the  average  case,  over  the  greedy  method,  so  it  appears 
that  trajectory  optimization  may  not  be  worth  the  ad¬ 
ditional  comjuitational  expense,  except  in  extreme  situ¬ 
ations  (see  Sollich  [199d]  for  a  theoretical  comparison  of 
greedy  and  globally-optimal  querying). 


4  Experimental  Results 

In  this  section,  we  describe  two  sets  of  experiments  us¬ 
ing  optimal  ex]ieriment  design  for  error  minimization. 
The  first  attempts  to  confirm  that  the  gains  predicted 
by  optimal  experiment  design  may  actually  be  realized  in 
practice,  and  the  second  applies  OED  to  learning  tasks 
with  static  and  dynamic  constraints.  All  experiments 
described  in  this  section  were  run  using  feedforward  net¬ 
works  with  a  single  hidden  layer  of  20  units.  Hidden  and 
output  units  used  the  0-1  sigmoid  as  a  nonlinearity.  All 
runs  were  performed  on  the  Xerion  simulator  [van  Camp 
ct.  ah,  199.3]  using  the  default  weight  update  rule  ( “Rudi's 
Cionjugatc'  Cradient”  with  “Ray’s  Line  Search")  with  no 
weight,  decay  term. 


4.1  Expoctocl  versus  actual  gain  f 

It  must  be  emphasized  that  the  gains  jin'dicted  by  OEl) 
are  erpecied  gains.  These  ex])ectations  are  based  on  tin' 
series  of  approximations  detaih'd  in  the  previous  sec¬ 
tion.  which  may  compromise  the  realization  of  any  actual 
gain.  In  order  for  the  expected  gains  to  materialize,  two 
“bridges”  must  be  crossed.  First,  tin'  cxpecti'd  di'crease 
in  model  variance  must  be  realized  as  an  actual  decrease 
in  variance.  Second,  the  actual  decrease  in  model  vari¬ 
ance  must  translate  into  an  actual  decreasi'  in  modi'l 
MSF. 
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Figure  3:  (top)  Correlations  between  expected  change 
in  output  variance  and  actual  change  out|uit  variance, 
(bottom)  Correlations  between  actual  change  in  output 
variance  and  change  in  mean  squared  error,  t'orrelations 
are  plotted  for  a  network  with  a  single  hidden  layer  of  20 
units  trained  on  .50  examples  from  the  arm  kinematics 
task. 


4.1.1  Expected  decreases  in  variance  — ^  actual 
decreases  in  variance 

The  translation  from  expected  to  actual  changes  in 
variance  requires  coordination  between  the  exploration 
strategy  and  the  learning  algorithm:  to  predict  how  the 
variance  of  a  weight  will  change  with  a  new  piece  of  data, 
the  predictor  must  know  how  the  weight  itself  (and  its 
neighboring  weights)  will  change.  Using  a  black  box  rou¬ 
tine  like  backpropagation  to  update  the  weights  virtually 
guarantees  that  there  will  be  some  mismatch  between 
expected  and  actual  decreases  in  variance.  Experiments 
indicate  that,  in  spite  of  this,  the  correlation  between 
predicted  and  actual  changes  in  variance  are  relatively 
good  (Figure  3a). 

4.1.2  Decreases  in  variance  — >  decreases  in 
MSE 

A  more  troubling  translation  is  the  one  from  model 
variance  to  model  correctness.  Given  the  highly  nonlin¬ 
ear  nature  of  a  neural  network,  local  minima  may  leave 
us  in  situations  where  the  model  is  very  confident  but 
entirely  wrong.  Due  to  high  confidence,  the  learner  may 
reject  actions  that  would  reduce  its  mean  squared  error 
and  explore  areas  where  the  model  is  correct,  but  has  low 
confidence.  Evidence  of  this  behavior  is  seen  in  the  lower 
right  corner  of  Figure  3b,  where  some  actions  which  pro¬ 
duce  a  large  decrease  in  variance  have  little  effect  on 
Emsb-  This  behavior  appear  to  be  a  manifestation  of 
the  bias  term  discussed  in  Section  3;  these  queries  reduce 
variance  while  increasing  the  learner’s  bias,  with  no  net 
decrease  in  error.  While  this  demonstrates  a  weak  point 
in  the  OED  approach  (which  will  be  further  illustrated 
below),  we  find  in  the  remainder  of  this  section  that  its 
effect  is  negligible  for  many  classes  of  problems. 

4.2  Querying  with  static  constraints 

Here  we  consider  a  simple  learning  problems  with  static 
constraints:  learning  the  forward  kinematics  of  a  planar 
arm  from  examples.  The  input  X  =  {0i,02}  speci¬ 
fied  the  arm’s  joint  angles,  and  the  learner  attempted 
to  learn  a  map  from  these  to  the  Cartesian  coordinates 
Y  =  {Ci,C2}  of  the  arm’s  tip.  The  “shoulder”  and  “el¬ 
bow”  joints  were  constrained  to  the  0  —  360"  and  0  —  180" 
respectively;  on  each  time  step  the  learner  was  allowed 
to  specify  an  arbitrary  x  £  X  within  those  limits. 

For  the  greedy  OED  learner,  x  was  chosen  by  begin¬ 
ning  at  a  random  point  in  X  and  hillclimbing  the  gradi¬ 
ent  of  Equation  6  to  a  local  maximum  before  querying. 
This  strategy  was  compared  with  simply  choosing  x  at 
random,  and  choosing  x  according  to  a  uniform  grid  over 
A.® 

We  compared  the  variance  and  MSE  of  the  OED- 
based  learner  with  that  of  the  random  and  grid  learn¬ 
ers.  The  average  variance  of  the  OED-based  learner  was 
almost  identical  to  that  of  the  grid  learner  and  slightly 
better  than  that  of  the  random  learner  (Figure  4b).  In 

®Note  the  uniform  grid  strategy  is  not  viable  for  incre¬ 
mentally  drawn  training  sets  -  the  size  of  the  grid  must  be 
fixed  before  any  examples  are  drawn.  In  these  experiments, 
entirely  new  training  sets  of  the  appropriate  size  were  drawn 
for  each  new  grid. 


terms  of  MSE  however,  the  greedy  OED  learner  did  not 
fare  as  well.  Its  error  was  initially  comparable  to  that  of 
the  grid  strategy,  but  flattened  out  at  an  error  approx¬ 
imately  twice  that  of  the  asymptotic  limit  (Figure  4b). 
This  flattening  appears  to  be  a  result  of  bias.  As  dis¬ 
cussed  in  Section  3,  the  network’s  error  is  composed  of  a 
variance  term  and  a  bias  term,  and  the  OED-based  ap¬ 
proach,  while  minimizing  variance,  appears  (in  this  case) 
to  leave  a  significant  amount  of  bias. 
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Number  of  examples 

Figure  4:  Querying  with  static  constraints  to  learn  the 
kinematics  of  a  planar  two-joint  arm.  (top)  Variance 
using  OED-based  actions  is  better  than  that  using  ran¬ 
dom  queries,  and  matches  the  variance  of  a  uniform 
grid,  (bottom)  MSE  using  OED-based  actions  is  ini¬ 
tially  very  good,  but  breaks  down  at  larger  training  set 
sizes.  Curves  are  averages  over  six  runs  apiece  for  OED 
and  grid  learners,  and  12  runs  for  the  random  learner. 
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4.3  Quoi-ying  with  dynamic  constraints 

For  learning  will)  dynamic  constraints,  we  again  used  the 
])la.nar  arm  ]irohlem,  but  tliis  time  witli  a  more  realistic 
rest  riction  on  new  in]nits.  For  t  he  first  series  of  experi¬ 
ments,  till'  learner  learned  the  kinematics  by  incremen¬ 
tally  adjusifng  0|  and  from  their  values  on  the  pre¬ 
vious  cpiery.  Tin'  limits  of  allowable  movement  on  each 
st.ejr  corres]ronded  to  constraints  with  known  dynamics. 
The  second  set.  of  experiments  involved  learning  the  dy¬ 
namics  of  th('  same  arm  based  on  torcpie  commands.  The 
nnknown  next  state  of  the  arm  corres]ionded  to  con¬ 
straints  with  nnknown  dynamics. 

4.3.1  Constraints  with  known  dynamics 

To  learn  the  arm  kinematics,  the  learner  hillclimhed 
to  find  the  0|  and  0'j  within  its  limits  of  movement  t  hat 
would  maximize  the  stochastic  approximation  of  Acor. 
On  each  time  step  0]  and  0v  were  limited  to  change  by 
no  more  than  ±36“  and  ±18“  res]iectively. 

W('  com]iared  variance  and  MSE  of  the  OED-based 
learner  with  that  of  an  identical  learner  which  explored 
randomly  by  “flailing,”  and  with  a  learner  trained  on  a 
series  of  hand-tnned  trajectories. 

Th('  greedy  OED-based  learner  found  exploration  tra¬ 
jectories  that ,  intuitively,  appear  to  give  good  global  cov¬ 
erage  of  the  domain  (see  Figure  5).  In  terms  of  perfor¬ 
mance,  t.he  average  variance  over  the  OED-based  trajec¬ 
tories  was  almost  as  good  as  that  of  the  best  hand-timed 
trajectory,  and  both  were  far  better  than  that  of  the 
random  ex])loration  trajectories.  In  terms  of  MSE.  the 
average  error  over  OED-based  trajectories  was  almost 
as  good  as  that,  of  the  best  hand-tnned  trajectory,  and 
again,  botli  w'ere  far  better  than  the  random  exploration 
trajectories  (Figure  6).  Note  that  in  this  case,  bias  does 
not  seem  to  play  a  significant  role.  We  discuss  the  per¬ 
formance  and  computational  complexity  of  this  task  in 
greater  det.ail  in  Section  5. 


Figure  5:  Querying  with  dynamic  constraints:  learning 
2D  arm  kinematics.  Example  of  OED-based  learner’s 
trajectory  through  angle-sjiace. 


4.3.2  Constraints  with  unknown  dynamics 

For  this  set  of  ex]>eriments,  we  once  again  used 
the  jjianar  two-jointed  arm.  but  now  attempted  to 
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Fi  gure  6:  Querying  with  dynamic  constraints:  learning 
2D  arm  kinematics,  (top)  Variance  using  greedy  QED 
actions  is  better  than  that  using  random  exploration, 
and  matches  the  variance  of  the  best  hand-tnned  tra¬ 
jectory.  (bottom)  MSE  using  greedy  OED-based  explo¬ 
ration  is  much  better  than  that  of  random  exploration 
and  almost  as  good  as  that  of  the  best  hand-tnned  tra¬ 
jectory.  Chirves  are  averages  over  5  runs  apiece  for  OED- 
based  and  random  exploration. 


learn  the  arm  dynamics.  The  learner’s  input  A'  = 
{01 , 02, 01 , 0v.  Ti ,  Tv}  s])erified  the  joint  positions,  ve¬ 
locities  and  torcpies.  Based  on  these,  the  learner 
attempted  to  learn  the  arm’s  next  state  Y  — 
{0'i,  0f,,  0'i ,  01, }.  As  with  the  kinematics  experiment, 
we  compared  random  exploration  with  the  greedy  OED 
strategy  described  in  the  previous  section.  Without 
knowing  the  dynamics  of  the  input  constraints,  however, 
we  do  not  have  the  ability  to  specify  a  preset  trajectory. 

The  performance  of  the  learner  whose  exjilorat  ion  was 
guided  hy  OED  was  asymptotically  much  better  than 
that  of  the  learner  following  a  random  search  strategy 
(Figure  7).  It  is  instructive  to  notice,  however,  that  this 
improvement  is  not  immediate,  but  appears  only  after 
the  learner  has  taken  a  number  of  steps.'  Intuitively, 


'This  behavior  is  visible  in  the  other  problem  domains  as 
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Figure  7;  MSE  of  forward  dynamic  model  for  two-joint 
planar  arm. 


this  may  be  explainable  by  the  assumptions  made  in  the 
OED  formalism:  the  network  uses  its  estimate  of  vari¬ 
ance  of  the  current  model  to  determine  what  data  will 
minimize  the  variance.  Until  there  is  enough  data  for  the 
model  to  become  reasonably  accurate,  the  estimates  will 
be  correspondingly  inaccurate,  and  the  search  for  “opti¬ 
mal”  data  will  be  misled.  It  would  be  useful  to  have  a 
way  of  determining  at  what  point  the  learner’s  estimates 
become  reliable,  so  that  one  could  explore  randomly  at 
first,  then  switch  to  OED-guided  exploration  when  the 
learner’s  model  is  accurate  enough  to  take  advantage  of 
it. 

5  Computational  costs  and 
approximations 

The  major  concern  with  applying  the  OED  techniques 
described  in  this  paper  is  computational  cost.  In  this  sec¬ 
tion  we  consider  the  computational  complexity  of  select¬ 
ing  actions  via  OED  techniques,  and  consider  several  ap¬ 
proximations  aimed  at  reducing  the  computational  costs. 
These  costs  are  summarized  in  Table  1,  with  the  time 
constants  observed  for  runs  performed  on  a  Sparc  10. 

We  divide  the  learning  process  in  three  steps:  train¬ 
ing,  variance  estimation,  and  data  selection.  We  show 
that,  for  the  case  examined,  in  spite  of  increased  com¬ 
plexity,  the  improvement  in  performance  more  than  war¬ 
rants  the  use  of  OED  for  data  selection. 

Cost  of  training:  Two  training  regimens  were  tested 
for  the  OED-guided  learners:  batch  training  reinitial¬ 
ized  after  each  new  example  was  added,  and  incremen¬ 
tal  training,  reusing  the  previous  network’s  weights  af¬ 
ter  each  new  example.  While  the  batch-trained  learners’ 
performance  was  slightly  better,  their  total  training  time 
was  significantly  longer  than  their  incrementally  trained 
counterparts  (Eigure  8). 

well,  but  is  not  as  pronounced. 


operation  I  constant  order 


Batch  train 

0.029 

mn 

Incremental  train 

0.093 

n 

Compute  exact  A 

OO 

o 

1 

mrE 

Compute  approx.  A 

7.2*10-® 

mil? 

Invert  to  get  A~^ 

3.2*10-' 

Compute  va7’(xr) 

5.0*10-® 

Compute  E[\var{X)\x] 

5.4*10-® 

o 

rn 

Compute  gradient 

1.9*10-® 

0 

rn 

Table  1:  Typical  compute  times,  in  seconds,  for  opera¬ 
tions  involved  in  selecting  new  data  and  training.  Num¬ 
ber  of  weights  in  network  =  n,  number  of  training  ex¬ 
amples  =  m,  and  number  of  reference  points  (at  which 
variance  or  gradient  is  measures)  =  r.  Time  constants 
are  for  runs  performed  on  a  Sparc  10  using  the  Xerion 
simulator. 


Cost  of  variance  estimation:  (Equation  3)  Vari¬ 
ance  estimation  requires  computing  and  inverting  the 
Hessian.  The  inverse  Hessian  may  then  be  used  for  an 
arbitrary  number  of  variance  estimates  and  must  only  be 
recomputed  when  the  network  weights  are  updated.  The 
approximate  Hessian  of  Equation  1  may  be  computed  in 
time  0{mn^),  but  the  major  cost  remains  the  inversion. 
We  have  experimented  with  diagonal  and  block  diago¬ 
nal  Hessians,  which  may  be  inverted  quickly,  but  with¬ 
out  the  off-diagonal  terms,  the  learner  failed  to  generate 
reasonable  training  sets.  Recent  work  by  Pearlmutter 
[1994]  offers  a  way  to  bring  the  cost  of  computing  the 
first  term  of  Equation  3,  but  computing  the  second  term 
remains  an  0{n^)  operation. 

Cost  of  data  selection:  (Equations  6,  7  and  8) 
Computing  Equation  6  is  an  O(n^)  operation,  which 
must  be  performed  on  each  of  r  reference  points,  and 
must  be  repeated  for  each  candidate  x.  Alternatively, 
the  “moment-based”  selection  (Equation  7)  and  gradi¬ 
ent  methods  (Equations)  both  require  an  0{n^)  matrix 
multiplication  which  must  be  done  once,  after  which  any 
number  of  iterations  may  be  performed  with  new  x  in 
time  0{n^).  Using  Perlmutter’s  approach  to  directly 
approximate  A~^g{x)  would  allow  an  approximation  of 
Equation  7  to  be  computed  in  0{n^)  times  an  “accu¬ 
racy”  constant.  We  have  not  yet  determined  what  effect 
this  time/ accuracy  tradeoff  has  on  network  performance. 

The  payoff:  cost  vs.  performance.  Obviously,  the 
OED-based  approach  requires  significantly  more  compu¬ 
tation  time  than  does  learning  from  random  examples. 
The  payoff  comes  when  relative  performance  is  consid¬ 
ered.  We  turn  again  to  the  kinematics  problem  discussed 
in  Section  4.3.1.  The  approximate  total  time  involved  in 
training  a  learner  on  100  random  training  examples  from 
this  problem  (as  computed  from  Table  1)  is  170  seconds. 
For  “full-blown”  OED,  using  incremental  training,  the 
total  time  is  790  seconds.  As  shown  in  Figure  8,  ex¬ 
ploring  randomly  causes  our  MSE  to  decrease  roughly 
as  an  inverse  polynomial,  while  the  various  OED  strate¬ 
gies  decrease  MSE  roughly  exponentially  in  the  number 
of  examples.  To  achieve  the  MSE  reached  by  training  on 
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Figure  8;  Learning  enrves  for  the  kinematics  prolilem 
from  Section  4.2.  Bc'st  fit  functional  forms  are  plotted 
for  random  exploration,  incrementally-trained  OLD  and 
OED  completely  retrained  on  new  data  set. 

OED-selected  data,  we  would  need  to  train  on  ajiprox- 
imately  .3380  randomly  selected  data  examides.  This 
would  take  approximately  7500  seconds,  over  two  hours! 
With  this  much  data,  the  training  time  alone  is  greater 
than  the  total  OED  costs,  so  regardless  of  data  costs, 
selecting  data  via  OED  is  the  preferable  approach. 

With  the  kinematics  example  there  is  the  option  of 
hand-tuning  a  learning  trajectory,  wtiich  reejuires  no 
more  data  than  the  OED  aipwoach.  and  can  nominally 
be  learned  in  less  time.  This,  however,  required  hours  of 
human  intervention  to  repeatedly  re-run  the  simulations 
trying  different  preset  exploration  trajectories.  In  the 
dynamics  exaini>le  and  in  other  cases  where  the  state 
transitions  arc  unknown,  preset  exploration  sOategies 
are  not  an  option;  we  must  rely  on  an  algorithm  for 
deciding  our  next  action,  and  the  OED-based  strategy 
appears  to  be  a  viable,  statistically  well-founded  choice. 

6  Conclusions  and  Future  Work 

The  experiments  described  in  this  paper  indicate  that, 
for  some  tasks,  optimal  exiieriment  design  is  a  promis¬ 
ing  tool  for  guiding  active  learning  in  neural  networks. 
It  requires  no  arbitrary  discretization  of  state  or  action 
spaces,  and  is  amenable  to  gradient  search  techniques. 
The  appropriateness  of  OED  for  exploration  hinges  on 
the  t  wo  issues  described  in  the  previous  two  sections,  the 
nature  of  the  in]nit  constraints  and  the  computational 
load  one  is  able  to  bear. 

For  learning  problems  with  static  constraints,  the  ad¬ 
vantage  of  applying  OED,  or  any  form  of  intelligent  ac¬ 
tive  learning  appears  to  be  problem  dependent.  Random 
explorat  ion  apjrears  to  be  reasonably  good  at  decreasing 
vari.ance,  and  as  seen  in  Section  4.2.  appears  to  decrease 
bias  as  well.  For  a  problem  where  learner  bias  is  likely  to 
be  a  ma]or  factor,  the  advantages  of  the  OED  approach 
ar('  unclear. 

The  real  advantage  of  the  OED-based  approach  ap¬ 
pears  to  li('  in  |>roblems  where  the  ininit  constraints  are 
dynamic,  and  where  random  actions  fail  to  provide  good 


exploration.  Oompared  with  arbitrary  heuristics,  tin' 
OED-based  approach  has  the  arguabh'  advantage  of  be¬ 
ing  the  “right  thing  to  do, "  in  spit<'  of  its  computational 
costs. 

The  cost,  however,  is  a  major  drawback.  A  d('cision 
time  on  the  order  of  1-10  seconds  may  be  sufficient  for 
manv  applications,  but  is  much  too  long  to  guid('  real¬ 
time  exploration  of  dynamical  systt'ins  such  as  robotic 
arms.  Tin'  operations  required  for  hessian  computa¬ 
tion  and  data  selection  may  be  efficiently  ])arallelized; 
the  remaining  computational  expense  lies  in  retraining 
the  network  to  incorporate  each  new  examphe  Tin'  re¬ 
training  cost,  which  is  common  to  all  on-lini'  neural 
ex|)loratiou  algorithms,  may  be  amortized  by  seh-cting 
(|ueries/act ions  in  small  batches  rather  than  {uuel\  si - 
quentially.  This  “semi-liatched  approach  is  a  piomising 
direction  for  future  work. 

Another  promising  direction,  which  offers  hope  of 
even  greater  speedups  than  the  semi-batch  approach,  is 
switching  to  an  alternative,  entirely  non-neural  learner 
with  which  to  pursue  exploration. 

6.1  Improving  porfornianco  with  altoriiativo 
learners 

We  may  be  abh'  to  bring  down  computational  costs  and 
improve  performance  by  using  a  diffi'rent  architecture 
for  the  learner.  With  a  standard  h'cdforward  neural 
network,  not  only  is  the  rc]7eated  computation  of  vari¬ 
ances  expensive,  it  sometimes  fails  to  yield  estimati's 
suitable  for  use  as  confidence  intervals  (as  we  saw  in 
Section  4.1.2).  A  solution  to  both  of  these  problems 
may  lie  in  selection  of  a  more  amenable  architecture  and 
learning  algorithm.  Two  such  architectures,  in  which 
output  variances  have  a  direct  role  in  estimation,  are 
mixtures  of  Gaiissians  [McLachlan  and  Basford,  1988; 
Nowlan,  1991;  Ghahramani  and  Jordan,  1994]  and  lo¬ 
cally  weighti'd  regression  [CJiweland  et  ah,  1988;  Schaal 
aiul  Atkeson,  1991],  Both  have  excellent  statistical  mod¬ 
eling  properties,  and  are  computationally  more  tractable 
than  feedforward  neural  networks.  We  are  currently  pur¬ 
suing  the  application  of  optimal  experiment  design  tedi- 
niques  to  these  models  and  have  observed  encouraging 
preliminary  results  [Cohn  et  ah,  1994]. 

G.2  Active  elimination  of  bias 

Regardless  of  which  learning  architecture  is  used,  the 
results  m  Section  4.2  make  it  clear  that  minimizing  \aii- 
ance  alone  is  not  enough.  For  large,  data-poor  problems, 
variance  will  likelv  be  the  major  source  of  eiioi,  but  as 
variance  is  removed  (via  the  techniques  describi'd  in  this 
paper),  the  bias  will  constitute  a  larger  and  larger  por¬ 
tion  of  the  remaining  error. 

Bias  is  not  as  easily  estimated  as  variance;  it  is  nsu- 
allv  estimatt'd  bv  expensive  cross  validation,  or  by  lim¬ 
ning  ensembles  of  learners  in  paralhd  (see,  e.g.  Geman 
et  al.  [1992]  and  Cbnnor  [1993]).  Future  work  will  need 
to  incliidi'  methods  for  efficiently  estimating  learner  bias 
and  taking  steps  to  ensure  that  it  too  is  minimized  in  an 
optimal  manner. 
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Appendix  -  System  identification  with 
neural  networks  and  OED 


Below,  we  revii'w  the  d<'rivation  of  th('  '‘D-oirtimality" 
criterion  appropriatt'  for  system  idf'ntification  [Fedorov, 
1972:  MacKay,  1992],  and  briefly  discuss  experinumts 
selecting  D-optimal  data. 


Win'll  doing  system  idi'iitificat ion  with  a  in'iiral  network, 
we  are  interested  in  minimizing  tin'  covariance  of  tin'  pa¬ 
rameter  estimates  ir.  For  the  purposes  of  opt  imizat  ion, 
it  is  convenient  to  express  (T'-,  as  a  scalar.  The  most 
wich'ly  used  scalar  is  the  determinant  E>  =  |frf  |,  which 
has  an  interpretation  as  tin'  "volume”  of  jiarameter  siiace 
encompa.ssed  by  the  variance  (for  other  approacln's  see 
Atkinson  and  Donev  [1992]). 


The  utility  of  querying  at  x,  from  a  system  identification 
viewpoint,  may  be  expressed  in  terms  of  the  expected 
change  in  the  estimated  value  of  D.  Tin'  exju'ctf'd  new 
value  D  is 


D  =  L4-'|  = 


52  -p. 


gii-y-A-^gil)  .S'2 


cr-.  - 
i/M 


(9) 


which,  by  subtraction  from  the  original  estimate  D  gives 

740-2  . 

A/4|i-  =  .  (10) 

Erpiation  10  is  maximized  where  o-?|-  is  at  a  maximum, 
giving  the  intuitively  jih'asing  interpretation  that  for  sy.s- 
tem  identification,  parameter  uncertainty  is  minimized 
by  querying  where  our  uncertainty  is  largest.  Such 
queries  are,  in  OED  terminology,  “D-optimal." 


Our  exi^eriments  using  the  above  criterion  to  select  train¬ 
ing  data  had  limited  success.  On  regression  problems 
such  as  the  arm  kinematics  the  learner  performed  poorly, 
attempting  to  select  data  at  x  =  ±cc-.  Tlu'se  results  are 
consistent  with  the  comments  at  the  beginning  of  this 
section,  and  with  MacKay ’s  observation  that,  for  learn¬ 
ing  A’  —  1'  mappings,  the  system  identification  crite¬ 
rion  may  be  the  "right  solution  to  the  wrong  problem” 
[MacKay,  1992],  The  criterion  addressed  in  Si'ction  3, 
also  mentioned  by  MacKay  and  exidored  in  greater  detail 
in  this  paper,  appears  to  address  the  “right”  probh'm. 


< 


System  identification  using  OED  has  been  succe.ssful  on 
tasks  where  the  parameters  of  the  unknown  system  are 
ex]ilicit,  but,  for  a  neural  network  model,  system  identi¬ 
fication  is  problematic.  The  weights  in  the  network  can 
not.  be  reasonably  considered  to  represent  real  param¬ 
eters  of  the  unknown  system  being  modeled,  so  there 
is  no  good  interpretation  of  their  “identity,”  A  greater 
problem  is  the  observation  that  unless  the  network  is 
fortuitously  structured  to  be  exactly  the  correct  size, 
there  will  be  extra  unconstrainable  parameters  in  the 
form  of  unused  weights,  about  which  it  will  be  impo.s- 
sible  to  gain  information.  Distinguishing  between  un- 
const.rainabh'  parameters  (which  we  wish  to  delete  or  ig¬ 
nore)  and  underconstrained  ]-)arameters  (about  which  we 
wish  to  get  more  information)  is  an  unsolved  problem. 
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